EDA of Cardiovascular diseases

1. Objective:

This dataset deals with cardiovascular diseases in patients with different aspects. The dataset consists of 70 000 records of patients data in 12 features, such as age, gender, systolic blood pressure, diastolic blood pressure, and etc. The target class "cardio" equals to 1, when patient has cardiovascular disease, and 0, if patient is not suffering from heart disease. All of the dataset values were collected at the moment of medical examination.

There are a total of 12 fields related with the cardiovascular disease data. So we try to find out the related aspects for patient having cardiovascular diseases among the total sample. The stakeholders for this study would be patients - for the well being and healthy life and also pharamaceutical companies to help them in the drugs manufacturing.

We are the founder and CEO of US cardiovascular disease center organization and we want to dedicate our research by giving back to the community and for the well being of the patients. Cardiovascular diseases are very dangerous if not diagnosed on time. If not taken proper care after being diagnosed it could lead to life-threatening factors. So our organization will help the public to find the probability of having a heart disease as per the patients features. Our report could incorporate information about the number of each type of cardiovascular disease related features. We are planning to study this by grouping the patients into "With" and "Without" heart disesase and analysing different factors/features that are highly correlated to cause a heart disesase from the given data.

2. Data Exploration:

2.1: Display top 5 rows of data

To understand and explore dataset, read the available cardio_train.csv file using pandas and display top 5 rows of the data

Observations from output: The dataset has details about the patients like ID, age(in days), gender(1-Women and 2-Men), height(in cm), weight(in kg), systolic blood pressure(ap_hi), diastolic blood pressure(ap_lo), cholesterol group(1: normal, 2: above normal, 3: well above normal), glucose level(: normal, 2: above normal, 3: well above normal), smoke, alcohol and physical activity by patient(0-No, 1-Yes) and the target variable, cardio(0-No, 1-Yes)

2.2: Check the total number of rows and column in the dataset

Observations from output: The dataset has a total of 70000 rows and 13 columns.

2.3: Describe the dataset

Observations from output: The average age of the patients used in the sample is 19468 days, height is 164.35cm and weight is 74Kg. Mean of Chloesterol and Glucose is 1.3 and 1.22 respectively which are close to "Normal" indicating most of the patients in the sample are normal on cholesterol and glucose levels. Mean of smoke and alcohol of patients in the dataset are less than 0.5 revealing mosting of them are non-smokers and non-alcholic. Mean of actie is greater than 0.5 which means most of the patients work-out for theit health. Mean of cardio is 0.49 almost close to 0.5 showing 50% of patients in the sample are affected by heart disease. There seem to be abnormally high and low values on ap_hi, ap_lo which should be taken care while cleaning data.

2.4: Find the data type of each column in dataset

Observations from output: All the column in the dataset are numerical and most of them are int64 type except one, which is of float type.

3. Data Cleaning:

As part of data cleaning, check for rows with NA values.

Observations from output: There are no rows with NA values in the dataset.

Observations from output: After removing rows with abnormally high and low(even negative) values of systolic blood pressure 69772 rows remained in the dataset.

Observations from output: After removing rows with abnormally high and low(even negative) values of diastolic blood pressure 68747 rows are remaining for the data analysis.

4. Adaptation:

4.1 Creating a new column to convert age(in days) to age in years:

Observations from output: A new column age_years is created holding values of age in years. This column can be used to find if people tend to have higher risk of heart disease with age.

4.2 Creating a new column to find BMI of patients:

Observations from output: A new column bmi is created by using weight(kg)/(height(m)* height(m)) of each patient. This column can be used to find if people with higher BMI are at higher risk of heart disease.

Visualization

5. Descriptive Visualization:

5.1 Plotting a bar chart to visualize gender vs count of cardiovascular diseased.

Observation from bar chart: This graph shows that gender plays an important role in not having heart disease and females are more healthier than the male participants. Contradicting to this, in case of people having heart diseases, females count is higher compared to males. The reason for contradiction could be that females are more in the dataset used for anlaysis than males.

5.2 Plotting a histogram for age(in years) column.

Observation from histogram: We can see that patients with age group of 50-60 years are more in the dataset used for analysis. The dataset is collected in this fashion probably because to check if people in this age grop tend to have higher chances of having cardiovascular disease.

5.3 Plotting a pie-chart on percentage of people in each cholesterol group.

Observations from Pie-Chart: This pie chart shows the cholesterol level is normal for most of the participants. 75% of the total participants has normal cholesterol levels while 13.5% and 11.5% have abmormal and well above normal levels.

5.4 Plotting a box plot between gender and BMI to see if alcohol has an impact on cardiovascular disease.

Observations from Box-plot: This box plot depicts that in the group of people with cardiovascular disease(right), women who consume alcohol have higher risks than acoholic men based on thier BMI. Also, alcoholic women are more prone to heart diseases than non-alcoholic men. Also, in the group of people without cardiovascular disease(left), women are more healthier than men.

Cluster Analysis

6. Explore Correlations:

Observations from Heatmap: The heatmap shows the target variable(cardio) has strong correlation with systolic blood pressure(ap_hi), diastolic blood pressure(ap_lo),age and cholesterol when compared to other features.

7. Step-by-step hierarchical clustering:

Create a dataframe using patient id, ap_hi, ap_lo, cholesterol and target variable(cardio) that can be used for hierarchical clustering. Since the dataset is huge with 68K rows, group patients on ap_hi and ap_lo to create clusters.

Observation from output: There are 981 groups of patients as per systolic and diastolic blood pressure measures.

Calculate eucledian distance using scipy.spatial library and perform hierarchical clustering.

Observations from output: The eucledian distance for 981 groups of patients is calculated correctly as above. Since cardio is either 0 or 1 most of the values are either 0 or 1.

Plotting the data to find clusters:

Observation: The plot shows the possibility of having 2 clusters.

8. Dendrogram:

Plotting a dendrogram for the grouped data(on ap_hi and ap_lo):

Observations from dendrogram: The dendrogram shows that there are 2 groups of clusters after grouping by blood pressure measures.

Unsupervised Clustering

9. Step-by-step K-means:

Creating function:

Creating a function, kmeans_fun, which takes the dataframe(df), number of clusters(k), number of dimensions(num_dim) and number of iterations(num_iter) to be performed as parameters/arguments. The function creates number of clusters(k) defined by using data from dataframe(df) with number of columns to be used for clustering(num_dim). The flow of function:

  1. Select and plot k number of random centroids. Centroids have num_dim of co-ordinates.
  2. Calculate distance of each point(with num_dim) to each one of the centroids.
  3. Assign each point to the cluster to which the centroid is at minimum distance.
  4. Recenter the centroids to the exact center of the cluster.
  5. Repeat steps 2, 3 and 4 by num_iter till the position of centroids do not change.
  6. If the position of the centroids do not change while recentering well before completing the number of iterations, the iterations are stopped and the latest centroids and clusters are defined to be optimal.

The function returns the dataframe by assigning each row to one of the clusters(Assoc) and the latest centroid co-ordinates.

Call the function created to cluster data on new dataframe with 3 dimensions(num_dim), 2 clusters(k), 10 iterations(num_iter):

Observations from output: The intial random centroid created, number of iterations to arrive at optimal centroids, the dataframe with each patient group associated to a cluster and final centroid are as shown above. Even though we passed 10 for the loop in the function to arrive at optimal centroids, the function could find the optimal centroids well before 10 iterations.

10. sklearn.cluster:

Create clusters using KMeans function from sklearn.cluster:

Observations from output: Total 2 clusters are created by KMeans algorithm. The patient groups in one cluster have the mean value of ap_hi and ap_lo comparatively higher than the mean values of ap_hi and ap_lo in another cluster. The cluster with higher blood pressure values are at a higher risk of heart diseases than the patient groups in other cluster.

Plotting the clusters:

Observations from plot: KMeans function from sklearn has clustered the patients into 2 as shown in the plot. Red, blue colours are used to show the patient clusters. From the graph it can be inferred that as the blood pressure values raise the risk for cardiac diseases increases.

Supervised Clustering

11. kNN Function:

Creating a function to predict KNN:

The function takes number of nearest neighbours, data point for prediction and the dataframe on which knn algorithm should be fitted.

The function is successfully created which can then be used for fitting and predicting cluster. To call the function, 3 dimensions, ap_hi, ap_lo and cholesterol are used to predict cardiac risks.

Observations from output: KNN predictions show that the cholesterol also plays an imporatant role for the increasing the probability of cardiac diseases along with the systolic and diastolic blood pressure measures. As the blood pressure measures or cholesterol levels increases a patient has a higher risk of cardiac disease.

Plotting cardio vs 3 dimensions(ap_hi, ap_lo and cholesterol)

Observations from the plot: Patient groups with lower values on ap_hi and ap_lo along with cholesterol levels have a healthy heart as compared to the patients with higher values on all 3 variables.

Findings

11. Project Report:

EDA of Cardiovascular diseases

Objective:

We are the founder and CEO of US cardiovascular disease center organization and we want to dedicate our research by giving back to the community and for the well being of the patients. Cardiovascular diseases are very dangerous if not diagnosed on time. If not taken proper care after being diagnosed it could lead to life-threatening factors. So our organization will help the public to find the probability of having a heart disease as per the patients features. Our report could incorporate information about the number of each type of cardiovascular disease related features. We are planning to study this by grouping the patients into "With" and "Without" heart disesase and analysing different factors/features that are highly correlated to cause a heart disesase from the given data.

For this study we are using the dataset deals with cardiovascular diseases in patients with different aspects. The dataset consists of 70 000 records of patients data in 12 features, such as age, gender, systolic blood pressure, diastolic blood pressure, and etc. The target class "cardio" equals to 1, when patient has cardiovascular disease, and 0, if patient is not suffering from heart disease. All of the dataset values were collected at the moment of medical examination.

Exploring Data:

The average age of the patients used in the sample is 19468 days, height is 164.35cm and weight is 74Kg. Mean of Chloesterol and Glucose is 1.3 and 1.22 respectively which are close to "Normal" indicating most of the patients in the sample are normal on cholesterol and glucose levels. Mean of smoke and alcohol of patients in the dataset are less than 0.5 revealing mosting of them are non-smokers and non-alcholic. Mean of active is greater than 0.5 which means most of the patients work-out for theit health. Mean of cardio is 0.49 almost close to 0.5 showing 50% of patients in the sample are affected by heart disease. There seem to be abnormally high and low values on ap_hi, ap_lo which should be taken care while cleaning data.

Data Cleaning:

There are no rows with NA values in the dataset. However since ap_hi and ap_lo columns had abnormal values we take the rows in the range of 30 to 250 for ap_hi and 30-150 for ap_lo. After cleaning for ap_hi and ap_lo 68747 rows of data are left which an be used for data analysis and clustering.

Data Visualization:

Plotting a bar chart to visualize gender vs count of cardiovascular disease:

gender_cardio_bar.png

This graph shows that gender plays an important role in not having heart disease and females are more healthier than the male participants. On similar lines, females count is higher compared to males in case of people having heart diseases. Hence the inference of predicting cardio based on gender might not be accurate.

Plotting a pie-chart on percentage of people in each cholesterol group:

chol_pie.png

The pie chart shows the cholesterol level is normal for most of the participants. 75% of the total participants have normal cholesterol levels while 13.5% and 11.5% have abmormal and well above normal levels.

Plotting a box plot between gender and BMI to see if alcohol has an impact on cardiovascular disease:

gen_alco_box.png

This box plot depicts that in the group of people with cardiovascular disease(right), women who consume alcohol have higher risks than acoholic men based on thier BMI. Also, alcoholic women are more prone to heart diseases than non-alcoholic men. In the group of people without cardiovascular disease(left), women are more healthier than men.

Explore Correlations:

corr_heatmap.png

The heatmap shows the target variable(cardio) has strong correlation with systolic blood pressure(ap_hi), diastolic blood pressure(ap_lo),age and cholesterol when compared to other features.

Dendrogram:

Plotting a dendrogram for the grouped data(on ap_hi and ap_lo):

dendrogram.png

The dendrogram shows that there can be 2 clusters when patients are grouped by ap_hi and ap_lo.

Unsupervised Clustering(KMeans):

Creating clusters using KMeans algorithm:

image.png

Total 2 clusters are created by KMeans algorithm. The patient groups in one cluster have the mean value of ap_hi and ap_lo comparatively higher than the mean values of ap_hi and ap_lo in another cluster. The cluster with higher blood pressure values are at a higher risk of heart diseases than the patient groups in other cluster.

Plotting the clusters from KMeans:

KMeans.png

KMeans function from sklearn has clustered the patients into 2 as shown in the plot. Red, blue colours are used to show the patient clusters. From the graph it can be inferred that as the blood pressure values raise the risk for cardiac diseases increases.

Supervised Clustering(KNN):

Plotting cardio vs 3 dimensions(ap_hi, ap_lo and cholesterol):

KNN.png

The 3 dimensions(ap_hi, ap_lo and cholesterol) are used to fit KNN to see if the patient has heart disease. Results: Patient groups with lower values on ap_hi and ap_lo along with cholesterol levels have a healthy heart as compared to the patients with higher values on all 3 variables.

Summary:

1) Cardio(a person having heart disease or not) has strong correlation with systolic blood pressure(ap_hi), diastolic blood pressure(ap_lo),age and cholesterol when compared to other features.

2) People with high blood pressure are at very high risk of cardiac disease.

3) People with hgh cholesterol have higher probability of getting affected by cardiovascualr disease.

4) Women who consume alcohol have higher risks of heart disease than acoholic men based on thier BMI.

Originality

"No other similar published works for KMeans and KNN found with same dataset". Every statement in this report is our own work. However we used references for python syntaxing.